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We introduce a class of neural networks derived from probabilistic models in the form of Bayesian 
belief networks. By imposing additional assumptions about the nature of the probabilistic models 
represented in the belief networks, we derive neural networks with standard dynamics that require 
no training to determine the synaptic weights, that can pool multiple sources of evidence, and that 
deal cleanly and consistently with inconsistent or contradictory evidence. The presented neural 
networks capture many properties of Bayesian belief networks, providing distributed versions of 
probabilistic models. 



I. INTRODUCTION 

Strong feedforward, feedback, and lateral connections 
exist between distinct areas of the cerebral cortex, but 
such connections are not observed in cerebellar, sensory, 
or motor output circuits. The anatomical structure of 
the cerebral cortex may facilitate a modular approach 
to solving complex problems 0, with different cortical 
areas being specialized for different information process- 
ing tasks. To permit a modular strategy of this sort, 
coordinated and efficient routing of information must be 
maintained between modules, which in turn demands ex- 
tensive connections throughout the cortex. 

It has been proposed JJ that cortical circuits perform 
statistical inference, encoding and processing information 
about analog variables in the form of probability density 
functions (PDFs). This hypothesis provides a theoretical 
framework for understanding diverse results of neurobi- 
ological experiments, and a practical framework for the 
construction of recurrent neural network models that im- 
plement a broad variety of information-processing func- 
tions niiii. 

Probabilistic formulations of neural information pro- 
cessing have been explored along a number of avenues. 
One of the earliest such analyses showed that the original 
Hopfield neural network implements, in effect, Bayesian 
inference on analog quantities in terms of PDFs (^] . As 
in the present work, Zemel et al. have investigated 
population coding of probability distributions, but with 
different representations and dynamics than those we will 
consider here. Several extensions of this representation 
scheme have been developed 0, "s", 10] that feature infor- 
mation propagation between interacting neural popula- 
tions. Additionally, several "stochastic machines" 



have been formulated, including Boltzmann machines 
[l^ . sigmoid belief networks ^^i^ Helmholtz ma- 

chines [l^- Stochastic machines are built of stochastic 
neurons that occupy one of two possible states in a prob- 
abilistic manner. Learning rules for stochastic machines 
enable such systems to model the underlying probability 
distribution of a given data set. 

The putative modular nature of cortical processing fits 
well in such a probabilistic framework. Cortical areas col- 
lectively represent the joint PDF over several variables. 
These neural "problem-solving modules" can be mapped 
in a relatively direct fashion onto the nodes of a Bayesian 
belief network, giving rise to a class of neural network 
network models that we have termed neural belief net- 
works 0, El- 

In contrast, recent work based on population-temporal 
coding 0, Q indicates that the modeling of low-level 
sensory processing and output motor control do not re- 
quire such a sophisticated representation: manipulation 
of mean values instead of PDFs is generally sufficient. 
Further, the representations can be simplified to deal 
with vector spaces describing the mean values instead of 
function spaces describing the probability density func- 
tions. 

In this work, we develop neural networks processing 
mean values of analog variables as a specialized form of 
the more general neural belief networks. We begin with a 
brief summary of the key relevant properties of Bayesian 
belief networks in section We describe a procedure 
for generating and evaluating the neural networks in sec- 
tion ^O] and apply the procedure to several examples in 
section Hvl 

II. BAYESIAN BELIEF NETWORKS 



'Electronic address: 'mjb@ uma.pt| 



Bayesian belief networks |l6l Il7| are directed acyclic 
graphs that represent probabilistic models (Fig.^). Each 
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node represents a random variable, and the arcs signify 
the presence of direct causal influences between the linked 
variables. The strengths of these influences are defined 
using conditional probabilities. The direction of a par- 
ticular link indicates the direction of causality (or, more 
simply, relevance); an arc points from cause to effect. 

Multiple sources of evidence about the random vari- 
ables are conveniently handled using BBNs. The be- 
lief, or degree of confidence, in particular values of the 
random variables is determined as the likelihood of the 
value given evidentiary support provided to the network. 
There are two types of support that arise from the evi- 
dence: predictive support, which propagates from cause 
to effect along the direction of the arc, and retrospective 
support, which propagates from effect to cause, opposite 
to the direction of the arc. 

Bayesian belief networks have two properties that we 
will find very useful, both of which stem from the depen- 
dence relations shown by the graph structure. First, the 
value of a node X is not dependent upon all of the other 
graph nodes. Rather, it depends only on a subset of the 
nodes, called a Markov blanket of X, that separates node 
X from all the other nodes in the graph. The Markov 
blanket of interest to us is readily determined from the 
graph structure. It is comprised of the union of the direct 
parents of X, the direct successors of X, and all direct 
parents of the direct successors of X. Second, the joint 
probability over the random variables is decomposable as 



P{xi,X2, ■ ■ ■,Xn) 



\[P{x^\Vs.{x^)) , (1) 



where Pa(a;^) denotes the (possibly empty) set of direct- 
parent nodes of X^. This decomposition comes about 
from repeated application of Bayes' rule and from the 
structure of the graph. 



III. MEAN- VALUE NEURAL BELIEF 
NETWORKS 

We will develop neural networks from the set of 
marginal distributions {p{x^]t)} so as to best match a 
desired probabilistic model p(xi,X2, . . . ,xd) over the set 
of random variables, which are organized as a BBN. One 
or more of the variables must be specified as evidence 
in the BBN. To facilitate the development of general up- 
date rules, we do not distinguish between evidence and 
non-evidence nodes in our notation. 

Our general approach will be to minimize the difference 
between a probabilistic model p{xi, X2, ■ ■ ■ , xjj) and an 
estimate of the probabilistic model p{xi, X2, ■ ■ ■ , xd)- For 
the estimate, we utilize 



p{xi,X2,.-.,xd) = ]^p(a;a;0 



(2) 



This is a so-called naive estimate, wherein the random 
variables are assumed to be independent. We will place 



further constraints on the probabilistic model and rep- 
resentation to produce neural networks with the desired 
dynamics. 

The first assumption we make is that the populations 
of neurons only need to accurately encode the mean val- 
ues of the random variables, rather than the complete 
PDFs. We take the firing rates of the neurons represent- 
ing a given random variable X^ to be functions of the 
mean value Xa{t) (Fig. [3) 



af{t)^g{Afxo.{t)+Br) 



(3) 



where Af and Bf are parameters describing the response 
properties of neuron i of the population representing ran- 
dom variable Xa ■ The activation function g is in general 
nonlinear; in this work, we take g to be the logistic func- 
tion, 



1 -I- exp {—x) 



(4) 



We can make use of (jSJ to directly encode mean values 
into neural activation states, providing a means to specify 
the value of the evidence nodes in the NBN. 

Using (PJ, we derive an update rule describing the neu- 
ronal dynamics, obtaining (to first order in r) 



ant + r)=g[A2x^it)+TA: 



ad Xa{t) 

dt 



(5) 



Thus, if we can determine how Xa changes with time, we 
can directly determine how the neural activation states 
change with time. 

The mean value x^ (t) can be determined from the fir- 
ing rates as the expectation value of the random variable 
Xa with respect to a PDF p(xa',t) represented in terms 
of some decoding functions {(j>f {xa)} The PDF is recov- 
ered using the relation 



(6) 



The decoding functions are constructed so as to minimize 
the difference between the assumed and reconstructed 
PDFs (discussed in detail in 's^). 

With representations as given in we have 



(7) 



= J2a'^{t)x: , 



where we have defined 



(8) 



Although we used the decoding functions (f)" {xa) to cal- 
culate the parameters if, they can in practice be found 
directly so that the relations in (j2Jl and Q are mutually 
consistent. 
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We take the PDFs p{xa',t) to be normally dis- 
tributed with the form p{xa]t) = p{xa;Xa(t)) — 
N{xa;Xa{t),a'^^). Intuitively, we might expect that the 
variance should be small so that the mean value is 
coded precisely, but we will see that the variances have 
no significance in the resulting neural networks. 

The second assumption we make is that interactions 
between the nodes are linear: 



Xp 



(9) 



Utilizing the causality relations given by the Bayesian be- 
lief network, we require that K/ja ^ only if is a child 
node of Xa in the network graph. To represent the linear 
interactions as a probabilistic model, we take the normal 
distributions p(a;^ | Pa(a;^)) = N {xi3;J2a -^Pa^a, '^p) ^'^^ 
the conditional probabilities. 

For nodes in the BBN which have no parents, the con- 
ditional probability pixp \ Pa{x0)) is just the prior prob- 
ability distribution p{x0). We utilize the same rule to 
define the prior probabilities as to define the conditional 
probabilities. For parentless nodes, the prior is thus nor- 
mally distributed with zero mean, pixp) = N{x/3] 0, cr^). 

We use the relative entropy 0| as a measure of the 
"distance" between the joint distribution describing the 
probabilistic model p{xi,X2, ■ ■ ■ ,Xd) and the PDF esti- 
mated from the neural network p{xi, X2, ■ ■ ■ , xd)- Thus, 



we minimize 



E 



-I M P{X1,X2, ...,Xd)\, , 

p(xi,a;2,...,a;_D)log — dxidx2 

\p[Xi,X2, ...,XDjJ 

(10) 

with respect to the mean values Xa- By making use of 
the gradient descent prescription 



dt 
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dE 

dX-y 



(11) 



and the decomposition property for BBNs given by CQ), 
we obtain the update rule for the mean values. 



dt 



V 



1 







K. 



Pi 



(12) 



Because the coupling parameters Kaf3 are nonzero only 
when Xa is a parent of X/^ , generally only a subset of the 
mean values contributes to updating Xj in (|12|l . In terms 
of the belief network graph structure, the only contribut- 
ing values come from the parents of X^, the children of 
Xj, and the parents of the children oi X-y] this is identical 
to the Markov blanket discussed in section ITTl 

The update rule for the neural activities is obtained by 
combining lO, (0, and ((T^ . resulting in 



5 + 



(13) 



The quantity Y^j S]ja!j{t) + B] serves to stabilize the 
activities of the neurons representing p{x-y) (similar to 
neural integrator models 

HE El), while 



p j 

■EE^/X(o 

a,P j 



(14) 



drives changes in aj{t) based on the PDFs represented 
by other nodes of the BBN. The synaptic weights of the 
neural network are 



C7 



47=7 



1 



iP 

ij 
Pi 



1 



w: 



■'Y0a 



-^l^Kp^x] 
^P 



(15) 
(16) 

(17) 

(18) 

(19) 



The foregoing provides an algorithm for generating and 
.e'\^i^ting neural networks that process mean values of 
random variables. To summarize, 

1. Establish independence relations between model 
variables. This may be accomplished by using a 
graph to organize the variables. 

2. Specify the Kap to quantify the relations between 
the variables. 

3. Assign network inputs by encoding desired values 
into neural activities using (jS)). 

4. Update other neural activities using p2|l . 

5. Extract the expectation values of the variables from 
the neural activities using ||7J). 



IV. APPLICATIONS 

As a first example, we apply the algorithm to the BBN 
shown in Fig. ^ with firing rate profiles as shown in 
Fig. El Specifying xi — 1/2 and X2 — —1/2 as evidence, 
we find an excellent match between the mean values cal- 
culated by the neural network and the directly calculated 
values for the remaining nodes (Table P). 

We next focus on some simpler BBNs to highlight cer- 
tain properties of the resulting neural networks (which 
will again utilize the firing rate profiles shown in Fig. [SI . 
In Fig. 131 we present two BBNs that relate three random 
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variables in different ways. The connection strengths are 
all taken to be unity in each graph, so that K21 — if 23 — 
K12 = Kn = 1. 

With the connection strengths so chosen, the two 
BBNs have straightforward interpretations. For the 
graph shown in Fig. 12^^, X2 represents the sum of Xi and 
X3, while, for the graph shown in Fig. X2 provides 
a value which is duplicated in Xi and X^. The differ- 
ent graph structures yield different neural networks; in 
particular, nodes Xi and X^ have direct synaptic con- 
nections in the neural network based on the graph in 
Fig. but no such direct weights exist in a second net- 
work based on Fig. ISJd. Thus, specifying xi — —1/4 and 
X2 = 1/4 for the first network produces the expected re- 
sult is = —0.5000, but specifying X2 — 1/^ in the second 
network produces xz = 0.2500 regardless of the value (if 
any) assigned to xi. 

To further illustrate the neural network properties, we 
use the graph shown in Fig. to process inconsistent 
evidence. Nodes Xi and X3 should copy the value in 
node X2 , but we can specify any values we like as network 
inputs. For example, when we assign xi — —1/4 and 
= 1/2, the neural network yields X2 = 0.1250 for the 
remaining value. This is a typical and reasonable result, 
matching the least-squares solution to the inconsistent 
problem. 

V. CONCLUSION 

We have introduced a class of neural networks that 
consistently mix multiple sources of evidence. The net- 
works are based on probabilistic models, represented in 
the graphical form of Bayesian belief networks, and func- 
tion based on traditional neural network dynamics (i.e., a 
weighted sum of neural activation values passed through 
a nonlinear activation function). We constructed the net- 
works by restricting the represented probabilistic models 



by introducing two auxiliary assumptions. 

First, we assumed that only the mean values of the ran- 
dom variables need to be accurately represented, with 
higher order moments of the distribution being unim- 
portant. We introduced neural representations of rele- 
vant probability density functions consistent with this 
assumption. Second, we assumed that the random vari- 
ables of the probabilistic model are linearly related to one 
another, and chose appropriate conditional probabilities 
to implement these linear relationships. 

Using the representations suggested by our auxiliary 
assumptions, we derived a set of update rules by min- 
imizing the relative entropy of an assumed PDF with 
respect to the PDF decoded from the neural network. 
In a straightforward fashion, the optimization procedure 
yields neural weights and dynamics that implement spec- 
ified probabilistic relations, without the need for a train- 
ing process. 

The restricted class of neural belief networks inves- 
tigated in this work captures many of the properties 
of both Bayesian belief networks and neural networks. 
In particular, multiple sources of evidence are consis- 
tently pooled based on local update rules, providing a 
distributed version of a probabilistic model. 
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FIG. 1: A Bayesian belief network. Evidence about any of 
the random variables influences the likelihood of, or belief in, 
the remaining random variables. In a straightforward termi- 
nology, the node at the tail of an arrow is a parent of the 
child node at the head of the arrow, e.g. X4 is a parent of 
X5 and a child of both X2 and X3 . Prom the structure of the 
graph, wc can sec the conditional independence relations in 
the probabilistic model. For example, X5 is independent of 
Xi and X2 given X3 and X4. 
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FIG. 2: The mean values of the random variables arc encoded 
into the firing rates of populations of neurons. A population of 
twenty neurons with piecewise-linear responses is associated 
with each random variable. The neuronal responses a" are 
fully determined by a single input ^, which we interpret as 
the mean value of a PDF. The form of the neuronal transfer 
functions can be altered without affecting the general result 
presented in this work. 



TABLE I: The mean values decoded from the neural net- 
work closely match the values directly calculated from the 
linear relations. The coefficients for the linear combinations 
were randomly selected, with values K31 = —0.2163, K32 = 
-0.8328, 7^42 = 0.06 27, K43 = 0.1438, K53 = -0.5732, and 
K54 = 0.5955. 
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FIG. 3: Simpler BBNs. Although the underlying undirected 
graph structure is identical for these two networks, the direc- 
tion of the causality relationships between the variables are 
reversed. The neural networks arising from the BBNs thus 
have different properties. 



